Technical  Report 


Department  of  Computer  Science 
and  Engineering 
University  of  Minnesota 
4-192  EECS  Building 
200  Union  Street  SE 
Minneapolis,  MN  55455-0159  USA 


TR  06-010 


Incremental  Window-based  Protein  Sequence  Alignment  Algorithms 
Huzefa  Rangwala  and  George  Karypis 


March  23,  2006 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

23  MAR  2006  2  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2006  to  00-00-2006 

4.  TITLE  AND  SUBTITLE 

Incremental  Window-based  Protein  Sequence  Alignment  Algorithms 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Department  of  Computer  Science  and  Engineering, University  of 
Minnesota, 200  Union  Street  SE, Minneapolis, MN, 55455-0159 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

18.  NUMBER  19a.  NAME  OF 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE 

unclassified  unclassified  unclassified 

10 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Incremental  Window-based  Protein  Sequence  Alignment  Algorithms 

Huzefa  Rangwala  and  George  Karypis 

Department  of  Computer  Science  &  Engineering,  University  of  Minnesota,Minneapolis,  MN  55455 

{rangwala,karypis}  @cs.umn.edu 


Abstract 


Motivation:  Protein  sequence  alignment  plays  a  critical  role 

in  computational  biology  as  it  is  an  integral  part  in  many  analysis 
tasks  designed  to  solve  problems  in  comparative  genomics,  structure 
and  function  prediction,  and  homology  modeling. 

Msthods:  We  have  developed  novel  sequence  alignment  algo¬ 

rithms  that  compute  the  alignment  between  a  pair  of  sequences 
based  on  short  fixed-  or  variable-length  high-scoring  subsequences. 
Our  algorithms  build  the  alignments  by  repeatedly  selecting  the 
highest  scoring  pairs  of  subsequences  and  using  them  to  construct 
small  portions  of  the  final  alignment.  We  utilize  PSI-BLAST  gen¬ 
erated  sequence  profiles  and  employ  a  profile-to-profile  scoring 
scheme  derived  from  PICASSO. 

Results:  We  evaluated  the  performance  of  the  computed  align¬ 

ments  on  two  recently  published  benchmark  datasets  and  compared 
them  against  the  alignments  computed  by  existing  state-of-the-art 
dynamic  programming-based  profile-to-profile  local  and  global  se¬ 
quence  alignment  algorithms.  Our  results  show  that  the  new  algo¬ 
rithms  achieve  alignments  that  are  comparable  or  better  to  those 
achieved  by  existing  algorithms.  Moreover,  our  results  also  showed 
that  these  algorithms  can  be  used  to  provide  better  information  as  to 
which  of  the  aligned  positions  are  more  reliable — a  critical  piece  of 
information  for  comparative  modeling  applications. 

Suppl.  Data  http://bioinfo.cs.umn.edu/supplements/win-aln/ 

1  Introduction 

Alignment  algorithms  serve  as  the  most  basic  sequence  anal¬ 
ysis  methods  in  computational  biology  and  have  a  wide  range 
of  applications  dealing  with  sequence  database  searching, 
comparative  modeling,  protein  structure  and  function  predic¬ 
tion. 

The  current  state-of-the-art  sequence  alignment  algo¬ 
rithms  have  a  well  defined  optimal  dynamic  programming 
based  solution,  introduced  decades  ago.  These  optimal  algo¬ 
rithms,  Smith- Waterman  [35]  and  Needleman-Wunsch  [27] 
solve  the  local  and  global  sequence  alignment  problems  re¬ 
spectively.  Over  the  years,  alignment  methods  have  advanced 
with  several  variations  of  the  optimal  alignment  method,  use 
of  gap  modeling  techniques  [13],  heuristics  [1,  29],  and  more 
recently  the  use  of  profile  [12,  7,  2]  and  structure  informa¬ 
tion  [18]. 

In  recent  years,  there  has  been  a  considerable  research 


effort  in  developing  kernel-based  methods  for  building  dis¬ 
criminatory  models  for  remote  homology  detection  and  fold 
recognition.  This  research  has  led  to  the  development  of  a 
number  of  protein  string  kernels  that  determine  the  similar¬ 
ity  between  a  pair  of  proteins  as  a  function  of  the  number  of 
sufficiently  similar  short  subsequences  that  they  share.  These 
string  kernels  have  proven  to  be  extremely  effective  in  build¬ 
ing  very  accurate  models,  and  these  methods  are  among  the 
best  performing  schemes  for  remote  homology  prediction  and 
fold  recognition  [22,  21,  30]  . 

Motivated  by  these  developments  in  string  kernels,  the 
work  in  this  paper  is  designed  to  address  the  question  as  to 
the  extent  to  which,  ideas  motivated  by  these  string  kernels 
can  be  used  to  build  alignments  between  a  pair  of  sequences. 
Toward  this  goal,  we  developed  a  set  of  window-based  align¬ 
ment  algorithms  that  are  heuristic  in  nature.  Our  methods 
incrementally  constructed  the  alignment  by  using  the  highest 
scoring  pairs  of  residues  between  the  two  sequences  at  each 
step.  The  residue  pair  scoring  was  borrowed  from  string  ker¬ 
nel  theory  where  to  score  the  residue  pairs  in  consideration, 
we  examined  short  subsequences,  referred  to  a  tenters  cen¬ 
tered  around  each  of  the  two  residues.  We  introduced  several 
heuristics  to  identify  aligned  residue  pairs  using  the  turners 
coupled  with  profile  information. 

We  determined  the  quality  of  our  alignment  methods 
by  evaluation  on  a  template-based  [7,  31]  and  a  model- 
based  dataset  [8,  5],  Our  empirical  results  on  the  two 
datasets  showed  the  competitive  performance  of  our  intro¬ 
duced  schemes  to  state-of-the-art  methods.  We  also  evaluated 
our  methods  by  determining  the  reliability  of  the  aligned  po¬ 
sitions  [17, 4,  32,  25,  36].  The  positive  results  for  some  of  our 
alignment  algorithms  on  such  a  reliability  metric  is  very  en¬ 
couraging  due  to  far  reaching  applications,  like  comparative 
modeling. 

2  Methods 

2.1  Sequence  Profiles  and  Profile  Scoring 

The  alignment  algorithms  that  we  developed  take  advantage 
of  evolutionary  information  by  utilizing  PSI-BLAST  [2]  gen¬ 
erated  sequence  profiles. 

The  profile  of  a  sequence  X  of  length  m  is  represented  by 
two  m  x  20  matrices.  The  first  is  its  position-specific  scoring 
matrix  PSSMx  that  is  computed  directly  by  PSI-BLAST  us¬ 
ing  the  scheme  described  in  [2],  The  rows  of  this  matrix  cor- 
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respond  to  the  various  positions  in  X  and  the  columns  corre¬ 
spond  to  the  20  distinct  amino  acids.  The  second  matrix  is  its 
position-specific  frequency  matrix  PSFMx  that  contains  the 
frequencies  used  by  PSI-BLAST  to  derive  PSSMx-  These 
frequencies  (also  referred  to  as  target  frequencies  [26])  con¬ 
tain  both  the  sequence-weighted  observed  frequencies  (also 
referred  to  as  effective  frequencies  [26])  and  the  BLOSUM62 
[15]  derived-pseudocounts  [2], 

Many  different  schemes  have  been  developed  for  deter¬ 
mining  the  similarity  between  profiles  that  combine  infor¬ 
mation  from  the  original  sequence,  position-specific  scoring 
matrix,  or  position-specific  target  and/or  effective  frequen¬ 
cies  [26,  37,  24],  In  our  work  we  use  a  scheme  that  is  derived 
from  PICASSO  [14,  26]  that  was  recently  used  in  developing 
effective  remote  homology  detection  and  fold  recognition  al¬ 
gorithms  [30],  Specifically,  the  similarity  score  between  the 
Zth  position  of  protein’s  X  profile,  and  the  j th  position  of  pro¬ 
tein’s  Y  profile  is  given  by 

20 

Sx,Y(i,j)  =  £  PSFM*(M)PSSMy( j,l)  + 

i=i 

(1) 

20 

£PSFMy(j,Z)PSSM*  (i,Z), 

i=i 

where  PSFMx(M)  and  PSSMx(M)  are  the  values  cor¬ 
responding  to  the  Zth  amino  acid  at  the  Zth  position 
of  X’s  position-specific  scoring  and  frequency  matrices. 
PSFMy  (j,  Z)  and  PSSMy(j,  Z)  are  defined  in  a  similar  fash¬ 
ion. 

2.2  Window-based  Alignments 

The  overall  methodology  of  the  alignment  algorithms  devel¬ 
oped  in  this  work  is  to  incrementally  construct  the  alignment 
by  using  various  heuristics  to  identify  the  pairs  of  aligned 
residues.  The  key  idea  shared  by  these  algorithms  is  that  they 
determine  whether  or  not  a  pair  of  residues  should  be  aligned 
together  by  examining  the  (short)  subsequences,  referred  to 
as  winers,  that  are  centered  around  each  of  the  two  residues. 

Given  a  sequence  X  of  length  m  and  a  user-supplied  pa¬ 
rameter  w ,  the  wrner  at  position  i  of  X  (w  <  i  <  m  —  w) 
is  defined  to  be  the  (2w  +  1) -length  subsequence  of  X  cen¬ 
tered  at  position  i.  That  is,  the  turner  contains  xt,  the  w 
amino  acids  before,  and  the  w  amino  acids  after  Xi.  A  pair 
of  turners  are  compared  by  computing  their  ungapped  align¬ 
ment  scores.  Given  two  sequences  X  and  Y,  the  ungapped 
alignment  score,  wscore(xj,  t/j),  between  a  pair  of  turners  at 
positions  i  and  j  of  X  and  Y,  respectively  is  given  by 

W 

wscore(x*,t/j)  =  ^  Sx,r{i  +  k,j  +  k),  (2) 

k—  —  w 

where  Sx,y  (*  +  k,  j  +  k)  is  the  alignment  score  between  Xi+k 
and  ijj+k  and  is  computed  using  Equation  1 . 


2.2.1  Central  Alignment  Scheme  (CA).  This  is  the 
simplest  alignment  algorithm  that  we  developed  and  com¬ 
putes  the  alignment  by  progressively  aligning  the  pairs  of 
residues  that  have  the  highest  positive  wscore  values  subject 
to  the  constraint  that  they  do  not  conflict  with  the  portion  of 
the  alignment  that  has  been  constructed  thus  far. 

Specifically,  given  two  sequences  X  and  Y  of  length  m 
and  n,  respectively  and  a  value  for  w,  it  starts  by  computing 
the  set  Sw  of  residue-pairs  that  are  candidates  for  inclusion  in 
the  alignment  by  considering  only  the  pairs  that  have  positive 
wscore  values.  That  is, 

Sw  =  {(xi,yj)  |  ■mcore(xi,yj)  >  0},  (3) 

where  w  <  i  <  m  —  w  and  w  <  j  <  n  —  w.  Then  it  per¬ 
forms  a  series  of  iterations  in  which  it  performs  the  following 
three  steps:  First,  it  extracts  from  Sw  the  residue-pair  with  the 
highest  wscore  value  ( *):  Second,  it  aligns  Xi *  against 
yj*:  Third,  it  removes  from  Sw  all  residue-pairs  that  cannot 
be  part  of  a  valid  alignment  given  that  xt*  and  y3>  have  been 
aligned  with  each  other.  This  process  terminates  when  Sw 
becomes  empty.  Positions  that  do  not  belong  to  any  of  the 
selected  residue  pairs  are  left  unaligned  (i.e.,  aligned  against 
spaces). 

The  residue  pairs  that  need  to  be  removed  are:  (i)  (xc ,  yf) 
VZ,  (ii)  (. xk,yj *)  VZc,  (iii)  ( xk,yi )  V(fc  >  i*  A  Z  <  j*),  and 
(iv)  (xk,yi)  V(Zc  <  i*  A  Z  >  j*).  The  first  two  conditions  re¬ 
move  from  Sw  all  residue -pairs  involving  ,u>  or  y3  - ,  as  these 
positions  have  now  been  aligned,  whereas  the  last  two  condi¬ 
tions  remove  the  residue-pairs  that  if  aligned,  will  introduce 
inversions  in  the  alignment. 

2.2.2  Subset  Alignment  Scheme  (SA).  A  limitation 
of  the  central  alignment  scheme  is  that  it  may  leave  a  large 
number  of  residues  unaligned  because  (i)  it  only  considers 
the  residue-pairs  with  positive  wscores,  and  (ii)  it  will  not 
align  the  first  and  last  w  positions  of  the  two  sequences  (Sw 
contains  only  pairs  involving  interior  residues). 

To  address  this  problem  we  developed  the  subset  align¬ 
ment  scheme  (SA),  which  can  be  considered  an  extension  to 
the  CA  scheme.  Specifically,  the  SA  scheme  modifies  the 
second  and  third  steps  of  the  CA  algorithm  as  follows.  Dur¬ 
ing  the  second  step,  in  addition  to  including  the 
pair  in  the  alignment,  it  also  includes  in  the  alignment  all  pre¬ 
viously  unaligned  residue-pairs  of  the  form  (xi*+k,Vj*+k) 
for  —  w  <  k  <  +w.  That  is,  it  can  potentially  include  all 
residue-pairs  involved  in  (xj*,t/j*)’s  wmer.  Note  that  due 
to  the  incremental  nature  of  the  algorithm,  the  second  step 
essentially  extends  the  alignment  around  the  (xi*+kiUj*+k) 
residue-pair  until  it  encounters  a  residue  (from  either  X  or 
Y)  that  has  already  been  aligned.  We  will  refer  to  this  as  the 
alignment  extension  operation.  During  the  third  step  the  SA 
algorithm  removes  from  Sw  all  residue-pairs  that  are  now  in 
conflict  with  all  aligned  residue-pairs  that  were  selected  in 
second  step. 
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2.2.3  Central  and  Subset  Alignment  Scheme 
(CSA).  A  potential  problem  with  the  SA  scheme,  is  that  it 
may  align  a  pair  of  residues  (a ;,»+&,  Vj*+k)  with  each  other, 
even  when  Sw  contains  residue-pairs  with  higher  tuscore  val¬ 
ues  for  either  or  both  of  the  two  residues.  This  happens,  be¬ 
cause  SA’s  alignment  extension  operation  extends  the  align¬ 
ment  as  soon  as  it  extracts  the  highest  scoring  residue  pair 
from  Sw  and  there  may  be  some  higher-scoring  turners  for 
these  positions  in  Sw. 

For  this  reason,  we  developed  a  hybrid  scheme  that  com¬ 
bines  the  CA  and  SA  approaches.  Specifically,  the  new 
scheme  first  computes  a  CA  alignment  and  then  augments  it 
by  applying  the  alignment  extension  approach  used  by  SA  to 
each  pair  of  its  aligned  residues. 

2.2.4  Variable  turner  Alignment  Scheme.  The  align¬ 
ment  schemes,  CA,  SA,  and  CSA  were  discussed  in  the  con¬ 
text  of  a  fixed  length  wmer.  The  potential  drawback  of  this 
scheme  is  that  if  tu  is  set  to  a  relatively  large  value,  it  may  fail 
to  identify  positive  scoring  subsequences;  whereas  if  it  is  set 
too  low,  it  may  fail  to  reward  residue-pairs  that  have  relatively 
long  similar  subsequences. 

For  this  reason  we  extended  the  algorithms  to  also  operate 
with  variable  length  turners.  The  key  difference  from  the  use 
of  fixed  length  turners  centered  around  residue  pairs  Xi  and 
Uj  is  the  fact  that  we  define  length  tu*  in  the  range  of  1  to  w, 
such  that 

w*  =  argmax  K.sccne(xi,yj),  (4) 

K=  1 

where  /Cscore  is  the  (2/C  +  1)— subsequence  score  as  defined 
in  Equation  2. 

Our  alignment  schemes  start  by  computing  the  set  S'  w  of 
residue  pairs  that  are  candidates  for  inclusion  in  the  alignment 
by  considering  only  pairs  that  have  positive  tu*score  values. 
With  this  change  all  steps  of  our  alignment  algorithms  remain 
same.  Note  that  the  SA  scheme  using  the  variable  length 
turners  will  have  its  alignment  extension  operation  extended 
till  a  maximum  length  of  tu*. 

As  a  notation  reference  we  denote  the  variable  turner 
alignment  algorithms  by  CA*\  S A’6' ,  and  CSA1'  to  distinguish 
them  from  the  fixed  turner  alignment  algorithms  denoted  in 
this  study  by  CA7 ,  SA7,  and  CSA7. 

3  Materials 

3.1  Evaluation  Methodologies  and  Metrics 

We  evaluated  the  performance  of  the  proposed  window-based 
alignment  algorithms  by  considering  (i)  the  quality  of  the 
alignment  itself  and  (ii)  the  extent  to  which  the  inherent  or¬ 
dering  of  the  aligned  pairs  of  residues  can  be  used  to  identify 
portions  of  the  alignment  that  are  more  reliable  than  others. 
In  order  to  assess  alignment  quality  we  used  two  widely  used 
methodologies,  often  referred  to  as  template-based  [7]  and 
model-based  [8],  whereas  the  reliability  was  assessed  by  fol¬ 
lowing  a  methodology  that  was  recently  proposed  in  the  con¬ 
text  of  comparative  modeling  [36]. 


3.1.1  Template-based  Approach.  The  first  method 
for  evaluating  alignment  quality  compares  the  differences  be¬ 
tween  the  alignment  generated  to  template  alignments  [7,  31, 
8].  These  template  alignments  are  generally  derived  from  var¬ 
ious  structural  alignment  programs  and  are  considered  to  be 
the  gold  standard. 

We  use  three  quality  scores,  namely  the  developer’s  score 
(/d)  [31],  the  modeler’s  score  (Jm)  [31]  and  the  Cline  score 
(CS)  [4]  to  compare  the  template  alignments  with  the  gen¬ 
erated  alignments.  The  developer’s  score  is  the  number  of 
correctly  aligned  residue  pairs  in  the  generated  alignment  di¬ 
vided  by  the  length  of  the  template  alignment.  (The  length 
of  an  alignment  is  defined  as  the  number  of  aligned  residue 
pairs.)  The  modeler’s  score  computes  the  ratio  of  correctly 
aligned  residue  pairs  with  the  length  of  the  generated  align¬ 
ment.  The  Cline  score  was  developed  to  address  the  issues 
with  Jm  and  fo  by  penalizing  both  under-alignment  and 
over-alignment,  and  also  crediting  regions  in  the  generated 
alignment  that  may  be  shifted  by  a  few  positions  relative  to 
the  reference  alignment  [7,  4].  The  steps  for  computation  of 
the  Cline  score  can  be  found  in  the  study  [4], 

Note  that  the  fo  and  /m  scores  are  equivalent  to  the  more 
traditional  measures  of  recall  and  precision  [9],  respectively 
that  are  used  extensively  to  measure  prediction  performance. 
In  the  rest  of  the  discussion  we  will  primarily  refer  to  /D 
and  Jm  by  the  more  intuitive  names  of  recall  and  precision, 
respectively. 

3.1.2  Model-based  Approach.  An  alternative  to  us¬ 
ing  a  template-based  approach  is  to  build  a  structural  model 
from  the  alignment  and  evaluate  the  similarity  between  the 
model  and  the  template  structure  [8,  28].  Starting  from  the 
alignment  between  a  pair  of  proteins  (one  protein  considered 
to  be  the  query  protein,  the  second  considered  to  be  the  tar¬ 
get  protein  whose  3D  structure  is  known),  a  model  protein 
is  created  which  consists  of  the  carbon  alpha,  Ca  atoms  of 
the  query  protein.  The  atomic  coordinates  of  this  model  pro¬ 
tein  are  the  atomic  coordinates  of  the  target  protein  i.e.,  for 
every  aligned  pair  of  residues,  the  query  protein  has  its  Ca 
atomic  coordinates  replaced  by  the  corresponding  atomic  co¬ 
ordinates  of  the  target  protein.  The  similarity  between  the 
two  structures  (the  model  protein  and  target  protein)  after  a 
structural  super-imposition  [23],  is  used  as  an  assessment  of 
sequence  alignment  quality. 

In  our  study,  we  computed  this  similarity  using  the 
LGscore  [5]  that  takes  into  account  the  common  segments 
between  the  pair  of  proteins.  LGscore  computes  the  sim¬ 
ilarity  between  two  protein  structures  (model  and  template 
structure)  based  on  the  common  segments  between  them.  It 
is  desirable  to  have  long  common  segments  with  high  struc¬ 
tural  similarity.  The  LGscore  measure  was  used  to  evalu¬ 
ate  the  structures  obtained  by  threading  methods  [28]  in  the 
CAFASP2  [10]  and  LiveBench  [3]  experiments  as  well  as  a 
sequence  alignment  quality  measure  [8]. 

Note  that  instead  of  LGscore  other  structural  similarity 
methods  or  protein  modeling  assessment  measures  can  be 
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used  for  evaluating  the  quality  of  the  model  (e.g  rmsd  mea¬ 
sure  [19],  global  distance  test  score  (GDT)  [38]  and  Max- 
Sub  [34]).  However,  for  this  study  we  show  only  the  re¬ 
sults  using  the  LGscore  method  due  to  similarity  in  results 
obtained  when  tested  with  the  other  measures. 

3.1.3  Reliability  of  Aligned  Regions.  In  compara¬ 
tive  modeling  and  several  other  applications,  it  is  essential  not 
only  to  align  residue  pairs  but  also  to  provide  some  reliabil¬ 
ity  index  or  confidence  measure  associated  with  the  aligned 
residue  pairs.  While  building  protein  structure  models  us¬ 
ing  comparative  modeling  strategies  it  is  important  to  include 
only  those  regions  where  the  alignment  is  considered  to  be 
good  or  reliable  [17,  4,  32,  25,  36]. 

One  of  the  reliability  assessment  measures  calculated  a 
smoothed  profile-derived  alignment  score.  The  score  for  each 
of  the  aligned  residue  in  the  template  alignment  was  com¬ 
puted  using  a  triangular  smoothing  window  of  size  5.  The 
reliability  was  assessed  by  setting  up  a  threshold  value  for  the 
smoothed  profile-derived  score  [36].  Our  approach  for  relia¬ 
bility  assessment  was  very  similar  to  this  method. 

Using  the  template-based  benchmarks  we  evaluated  the  re¬ 
liability  of  the  aligned  residue  pairs  by  ranking  the  aligned 
pairs  in  the  query  alignment.  We  score  the  aligned  positions 
using  fixed  length  ruscores.  The  reliability  measure  is  com¬ 
puted  as  the  recall  at  different  percent  levels  of  incorrectly 
aligned  residue  pairs  (up  to  5%).  The  notion  of  a  hit  is  de¬ 
fined  as  having  the  same  aligned  residue  pairs  in  both  the 
query  and  template  alignments.  The  difference  in  our  relia¬ 
bility  scheme  was  the  use  of  a  profile-to-profile  scoring  func¬ 
tions  equally  weighted  at  all  positions  of  the  wmer  rather  than 
using  a  smoothing  wmer  [36]. 

3.2  Datasets 

For  the  template-based  assessment  scheme  we  used  a  dataset 
created  to  evaluate  the  various  profile-to-profile  scoring  func¬ 
tions  for  protein  sequence  alignment  [7].  The  dataset  consists 
of  588  reference  alignment  pairs  having  high  structural  sim¬ 
ilarity  but  low  sequence  identity  (<  30%).  This  dataset  was 
selected  to  have  a  high  pairwise  structural  similarity  using  the 
consensus  of  FSSP  [16]  and  CE  [33], 

For  the  model-based  evaluation  scheme,  we  used  a  bench¬ 
mark  created  from  SCOP  1.39  filtered  to  only  contain  do¬ 
mains  with  less  than  50%  pairwise  sequence  identity  [8].  This 
dataset  contains  of  9983  protein  domain  pairs,  such  that  1903 
belong  to  the  same  families,  3101  share  only  the  same  super¬ 
family,  and  4979  share  only  the  same  fold.  Due  to  the  non- 
symmetrical  nature  of  models  built  from  alignments,  each 
pair  of  sequences  were  evaluated  twice — leading  to  a  bench¬ 
mark  of  19966  domain  pairs. 

3.3  Profile  Generation 

The  position  specific  score  and  frequency  matrices  used  by 
the  profile -based  scoring  method  of  Equation  1  were  gen¬ 
erated  using  the  latest  version  of  the  PSI-BLAST  algorithm 
(available  in  NCBI’s  blast  release  2.2.10),  and  were  derived 


from  the  multiple  sequence  alignment  constructed  after  five 
iterations  using  an  e  value  of  10-3.  The  PSI-BLAST  was 
performed  against  NCBI’s  nr  database  that  was  downloaded 
in  November  of  2004  and  contained  2,171,938  sequences. 

In  the  case  in  which  PSI-BLAST  could  not  produce  mean¬ 
ingful  alignments  for  certain  positions  of  the  query  sequence, 
the  corresponding  rows  of  the  two  matrices  are  derived  from 
the  scores  and  frequencies  of  BLOSUM62. 

4  Results 

In  this  section,  we  evaluate  the  performance  of  the  incre¬ 
mental  window  based  alignment  schemes  using  the  various 
benchmark  datasets  and  evaluation  metrics  discusses  in  Sec¬ 
tion  3. 

4.1  Assessment  of  Incremental  Window- 
based  Alignments 

Table  1  provides  an  extensive  set  of  results  illustrating  the 
performance  of  the  CA,  SA,  and  CSA  schemes  on  the 
template-based  dataset  for  different  values  of  w  and  for  fixed- 
and  variable -length  uimers.  Note  that  the  column  labeled 
“G5<15%”  shows  the  CS  results  for  the  subset  of  sequence- 
pairs  that  have  less  than  15%  sequence  identity  (i.e.,  a  subset 
that  is  inherently  harder  to  align  well). 

4.1.1  Central  vs  Subset  vs  Combined.  The  results 
of  Table  1  show  that  with  respect  to  the  CS  scores,  SA  tends  to 
perform  better  than  either  CA  or  CSA,  whereas  CA  performs 
consistently  the  worst.  The  only  exception  is  for  variable- 
length  wmers,  in  which  SA’s  performance  is  comparable  to 
that  of  CSA.  The  relative  advantage  of  SA  is  more  evident  if 
we  consider  the  subset  of  sequence-pairs  with  less  that  15% 
sequence  identity,  for  which  its  CS  scores  are  consistently 
higher  than  those  achieved  by  the  other  schemes  (SA  achieves 
a  score  of  0.649  whereas  CA  and  CSA  achieves  scores  of 
0.614  and  0.628,  respectively). 

By  looking  at  the  performance  of  the  various  schemes  in 
terms  of  recall,  we  can  see  that  SA’s  higher  CS-based  perfor¬ 
mance  is  due  to  the  fact  that  it  achieves  significantly  better  re¬ 
call  values  than  the  other  schemes.  This  was  to  be  expected, 
as  it  was  one  of  the  motivation  behind  the  development  of 
SA.  Also,  the  precision-based  results  show  that  CA  achieves 
somewhat  better  precisions  than  CSA,  whereas  SA’s  preci¬ 
sion  is  comparable  or  better  to  that  of  the  other  schemes. 

4.1.2  Fixed  vs  Variable  Length  Alignments.  Ana¬ 
lyzing  the  performance  of  alignment  methods  that  use  fixed 
length  wmers  compared  to  the  methods  that  use  variable 
length  winters,  we  notice  that  for  the  CA  and  CSA  schemes, 
for  the  same  turner  length  the  recall  as  well  as  the  precision 
scores  have  higher  values.  Note  that  the  higher  recall  is  ex¬ 
pected,  because  the  methods  using  a  variable  wmer  size  win¬ 
dow  will  have  a  higher  flexibility  in  allowing  larger  number  of 
turners  (with  a  positive  score)  to  be  picked  for  the  candidate 
set  S'  w. 

Another  key  observation  is  that  SA-f  performs  better  in 
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terms  of  recall  than  S A'J .  This  is  because  for  the  same  value 
of  tu ,  the  tu*  value  selected  by  SA1"  may  be  smaller  than  tu 
(i.e.,  the  value  used  by  SA-^).  As  a  result,  SA-^’s  alignment 
extension  operations  will  involve  longer  windows,  which  can 
produce  longer  alignments  than  SAL  and  thus  higher  recall 
values. 


Table  1:  Alignment  Accuracy  Results  on  a  Template-based 
Dataset. 


f M 

(precision) 

Id 

(recall) 

CS 

CS<  15% 

fixed 

central 

turner  =  2 

0.805 

0.791 

0.803 

0.600 

turner  =  3 

0.799 

0.776 

0.794 

0.596 

turner  =  4 

0.791 

0.756 

0.782 

0.587 

turner  =  5 

0.776 

0.732 

0.764 

0.572 

subset 

turner  =  2 

0.802 

0.835 

0.826 

0.626 

turner  =  3 

0.805 

0.842 

0.831 

0.642 

turner  =  4 

0.805 

0.842 

0.832 

0.644 

turner  =  5 

0.802 

0.838 

0.828 

0.649 

combined 

turner  =  2 

0.791 

0.822 

0.816 

0.619 

turner  =  3 

0.785 

0.819 

0.814 

0.623 

turner  =  4 

0.779 

0.811 

0.808 

0.624 

turner  =  5 

0.767 

0.798 

0.798 

0.624 

variable 

central 

turner  =  2 

0.799 

0.804 

0.809 

0.595 

turner  =  3 

0.802 

0.807 

0.812 

0.605 

turner  =  4 

0.805 

0.797 

0.810 

0.611 

turner  =  5 

0.805 

0.797 

0.807 

0.614 

subset 

turner  =  2 

0.798 

0.827 

0.820 

0.615 

turner  =  3 

0.798 

0.834 

0.825 

0.629 

turner  =  4 

0.798 

0.836 

0.827 

0.634 

■turner  =  5 

0.794 

0.832 

0.823 

0.636 

combined 

turner  =  2 

0.795 

0.822 

0.813 

0.600 

turner  =  3 

0.797 

0.827 

0.820 

0.614 

turner  =  4 

0.800 

0.831 

0.824 

0.621 

turner  =  5 

0.800 

0.832 

0.825 

0.628 

In  the  table  Jm  denotes  the  Modeler’s  score,  /D  denotes 
the  Developer’s  score,  CS  denotes  the  Cline  score,  and 
CS<  15%  denotes  the  Cline  score  for  a  subset  of  sequence 
pairs  sharing  less  than  15%  sequence  identity. 

4.1.3  Sensitivity  of  Schemes  with  respect  to  vary¬ 
ing  wmer  size  Looking  at  the  performance  achieved  by 
the  various  schemes  in  Table  I  as  tu  ranges  from  two  to  five, 
we  see  that  in  general,  SA’s  and  CSA’s  performance  does 


not  significantly  change  (e.g.,  CS  scores  stay  within  a  tight 
range),  whereas  C A-^  ’s  performance  tend  to  deteriorate  with 
increasing  tu.  This  latter  behavior  is  due  to  the  fact  that  as 
we  increase  the  turner  size,  fewer  turners  will  have  a  posi¬ 
tive  score  and  hence  will  not  be  included  as  part  of  the  set 
Sv, .  We  see  a  direct  effect  of  this  leading  to  a  decrease  in 
the  recall  scores.  Also  increase  in  the  turner  size  does  lead 
to  a  decrease  in  precision  score  as  well.  This  is  because  for  a 
larger  turner  window  the  positive  scoring  turners  may  not  be 
due  to  the  more  “central”  positions.  Evidence  of  this  can  be 
seen  by  comparing  the  behavior  of  the  CAV  scheme  in  which 
both  the  precision  and  recall  scores  stay  the  same. 

Another  key  observation  is  that  the  schemes  that  utilize 
variable  length  turners  tend  to  perform  better  for  larger  values 
of  tu.  This  is  because  of  the  flexibility  associated  with  using 
a  variable  length  turner. 

4.1.4  Alternative  Performance  Assessment  For 

this  dataset  too,  we  performed  a  thorough  parameter  study 
by  varying  turner  lengths  for  our  alignment  schemes.  We  ob¬ 
served  similar  results  as  seen  in  T:B1  for  the  template-based 
dataset.  In  Table  2  we  report  only  the  best  results  achieved 
rather  than  showing  results  for  varying  turner  sizes  as  done  in 
Table  1. 

Firstly,  we  notice  the  difference  in  the  LGscore  values  for 
the  family,  superfamily  and  fold  pairs  clearly  showing  the  dif¬ 
ficulty  nature  of  the  three  sets  of  problems,  with  the  fold-pairs 
being  the  hardest  to  model  followed  by  the  superfamily  and 
family  level  pairs. 

Similar  to  the  template-based  results,  the  SA  scheme  has 
the  best  LGscore  at  the  family,  superfamily  and  fold  levels  for 
both  the  variable  and  fixed  turner  setting.  A  surprising  fact 
was  that  the  performance  results  as  measured  by  the  LGscore 
did  not  decrease  with  increasing  turner  lengths.  In  fact,  we 
observed  that  the  use  of  a  higher  turner  size  of  5  for  the  fixed 
length  scheme  achieved  the  best  results  of  1 .53  and  4.29  for 
the  fold  and  superfamily  level  problems.  We  also  observe 
slightly  better  performance  for  the  variable  turner  schemes 
compared  to  the  fixed  turner  schemes. 

The  performance  of  the  CSAV  alignment  method  was  the 
lowest  for  both  the  family  and  superfamily  level  pairs  which 
contrasts  the  results  seen  previously  on  the  template-based 
dataset  in  Table  1 . 

4.2  Comparison  with  Earlier  Results 

4.2.1  Template-based  Benchmark.  Table  3  shows 
the  comparative  performance  of  our  window  based  schemes 
against  some  of  the  best  profile-to-profile  scoring  techniques 
studied  previously  [7].  In  the  table  we  show  results  for  the 
schemes  pdotp,  correlp  and  coach,  pdotp  uses  dot  product 
to  compute  the  similarity  between  two  profiles,  correlp  com¬ 
putes  the  Pearson  correlation  between  the  profile  columns, 
whereas  coach  [6]  uses  an  asymmetrical  complex  dot  product 
between  the  HMM  profile  and  a  position  frequency  matrix. 

We  show  results  of  these  schemes  as  published  previ¬ 
ously  [7]  using  SAM  T99  profiles  (The  performance  of  these 
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Table  2:  Alignment  Accuracy  Results  on  a  Model- 
based  Dataset. 


Alignment  Scheme 

Family  Superfamily 

Fold 

CA'  (2) 

14.86 

1.66 

0.04 

SAf  (5) 

16.44 

4.29 

1.53 

CSA/  (2) 

15.47 

2.53 

0.203 

CAV  (5) 

15.10 

2.43 

0.12 

SA'°  (5) 

16.48 

4.05 

1.05 

CSAV  (5) 

14.05 

2.32 

0.14 

The  numbers  in  the  parameter  indicate  the  wmer 
length  for  the  various  alignment  schemes. 

alignment  methods  using  SAM  T99  profiles  is  3-4%  better 
than  the  PSI-BLAST  based  profiles  [7])  Our  methods  show 
comparable  performance  to  these  alignment  methods  using 
SAM  T99  templates. 

We  also  compare  the  results  of  the  window  based  align¬ 
ment  methods  to  a  local  Smith-Waterman  [35]  alignment  al¬ 
gorithm  implementation  (SW-PSSM)  using  the  same  profile- 
to-profile  scoring  function  as  used  for  the  window  based 
alignments  (Equation  1).  Within  this  local  alignment  frame¬ 
work  we  use  an  affine  gap  model  along  with  a  zero-shift  pa¬ 
rameter  [37]  to  maintain  certain  necessary  requirements  of  a 
good  optimal  alignment.  We  optimize  the  gap  modeling  pa¬ 
rameters  (gap  opening  (go),  gap  extension  ( ge ))  and  the  zero 
shift  value  (zs)  to  obtain  highly  optimal  alignments  for  com¬ 
parative  purposes. 

We  observe  in  Table  3  that  the  incremental  window- 
based  alignment  schemes  perform  very  competitively  when 
compared  to  our  fully  optimized  SW-PSSM  implementation. 
Also  notice  the  superiority  of  our  optimized  SW-PSSM  im¬ 
plementation  to  the  alignment  methods  using  pdotp,  correlp 
and  coach  as  their  profile -profile  scoring  functions.  The  dif¬ 
ference  in  the  SW-PSSM  results  with  the  other  standard  align¬ 
ment  techniques  may  be  due  to  the  use  of  a  more  sensitive 
PICASSO  based  profile-to-profile  scoring  function.  Further, 
these  results  verify  that  we  are  comparing  our  novel  win¬ 
dow  based  alignment  methods  to  a  fully  optimized  SW-PSSM 
alignment  algorithm. 

The  performance  of  the  window-based  scheme  is  actu¬ 
ally  very  promising.  We  select  one  of  the  better  performing 
schemes  (SA^)  and  compare  it  to  the  optimized  SW-PSSM 
algorithm  using  the  CS  score.  Figure  1  shows  that  the  com¬ 
parative  performance  of  the  two  methods  across  the  588 
alignment  pairs  in  the  dataset. 

4.2.2  Model-based  Benchmark.  Our  results  in  Ta¬ 
ble  4  reiterate  the  closeness  in  performance  of  the  incremen¬ 
tal  window  based  alignment  method  to  the  highly  optimized 
SW-PSSM  alignment  algorithm  for  the  family,  superfamily 
and  fold  level  subsets. 

Table  4  also  shows  results  for  the  optimized  local  (local 
sequence  alignment  using  a  global  scoring  matrix),  global 


Central  Alignment  Scheme  (Fixed  Length  wmer  =  3) 

Figure  1:  Cline  Score  Comparison  of  SW-PSSM  scheme 
against  SA^  scheme  for  the  588  alignment  pairs  in  the 
template-based  dataset 


Table  3:  Comparative  Performance  with  Earlier  Results 
on  Template-based  Dataset. 


Alignment  Scheme 

/m 

fa 

CS 

cs<15% 

SAf  (3) 

0.805 

0.842 

0.831 

0.642 

SA"  (4) 

0.798 

0.836 

0.827 

0.634 

SW-PSSM 

0.803 

0.852 

0.841 

0.689 

pdotp  (T99) 

0.806 

0.829 

0.832 

0.697 

correlp  (T99) 

0.794 

0.835 

0.829 

0.702 

coach  (T99) 

0.797 

0.830 

0.829 

0.697 

The  optimized  SW-PSSM  results  are  achieved  us¬ 
ing  go  =  3.0,  ge  =  0.75,  zs  =  1.0.  In  the  table 
pdotp,  correlp,  coach  use  a  dot  product,  correlation 
function,  and  a  HMM  based  profile-profile  scoring 
function.  T99  denotes  the  use  of  SAM  T99  based 
profiles  respectively. 

(global  sequence  alignment  using  a  global  scoring  matrix), 
PSI  (3D-PSSM  [20]  based  global  sequence  alignment  against 
a  profile  [11]  obtained  from  PSI-BFAST),  SSPSI  [8](3D- 
PSSM  based  global  sequence  alignment  against  a  profile 
obtained  from  PSI-BFAST  using  secondary  structure  in¬ 
formation)  and  structural  (alignment  using  structural  super¬ 
imposition  by  lgscore2)  alignment  methods  published  previ¬ 
ously  [8].  The  structural  alignment  sets  up  a  higher  reference 
quality  score  for  the  benchmark.  Using  sequence  alignment 
techniques  we  would  like  to  achieve  these  high  levels  of  accu¬ 
racy.  The  results  shown  in  Table  4  for  the  various  previously 
published  schemes,  as  well  as  for  our  methods  are  the  best 
achieved  after  optimization  of  the  various  parameters. 

We  further  analyze  the  data  by  annotating  a  model  as  being 
correct  based  on  the  FGscore  value.  As  done  in  the  study  [8] 
we  use  the  less  strict  FGscore  cutoff  (10-3)  to  define  a  correct 
model  and  a  more  stringent  cutoff  (10-5)  to  identify  models 
of  higher  quality.  The  percentage  of  models  correct  based 
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on  these  cutoffs  are  shown  in  Table  5.  Both  the  incremental 
window-based  alignment  methods,  as  well  as  the  SW-PSSM 
alignment  method,  are  able  to  pick  the  correct  models  with 
similar  degrees  of  accuracy.  Our  techniques  also  seem  to 
identify  a  higher  percentage  of  correct  models  when  com¬ 
pared  to  the  previously  studied  schemes,  especially  PSI  and 
SSPSI,  both  of  which  also  incorporate  some  profile  informa¬ 
tion.  As  seen  from  Table  5  our  methods  are  able  to  pick  a 
larger  fraction  of  higher  quality  models  for  the  family  and 
superfamily  levels. 

4.2.3  Reliability  Performance.  Table  6  shows  the 
reliability  performance  for  the  window  based  alignment 
schemes  in  comparison  to  the  optimized  SW-PSSM  based 
alignment  scheme.  These  results  correspond  to  the  average 
recall  scores  obtained  for  all  the  alignment  pairs  at  different 
error  rates  using  the  procedure  described  in  Section  3.1.3. 

Though  the  SW-PSSM  algorithm  showed  slightly  better 
performance  in  terms  of  the  overall  alignment  quality  (Ta¬ 
ble  3  and  Table  4),  it  is  interesting  to  note  the  window-based 
schemes  using  variable  length  umers  showed  far  better  per¬ 
formance  at  the  lower  error  rates.  In  particular  before  see¬ 
ing  any  incorrect  predictions  in  the  ranked  aligned  positions, 
the  alignment  methods  using  variable  length  umers  have  a 
recall  around  0.260  compared  to  the  recall  of  0.205  for  the 
SW-PSSM  algorithm.  Note  that  the  recall  performance  of 
the  CSA  scheme  is  slightly  better  than  the  CA  scheme  and 
slightly  worse  compared  to  the  SA  alignment  scheme.  These 
results  can  be  explained  by  the  fact  that  the  high  scoring 
residue  pairs  aligned  by  CA  are  also  aligned  by  the  CSA 
scheme. 


Table  4:  Comparative  Performance  with  Earlier  Re¬ 
sults  on  a  Model-based  Dataset. 


Alignment  Scheme 

Family  Superfamily  Fold 

SA'  (5) 

16.44 

4.29 

1.53 

SA"  (5) 

16.48 

4.05 

1.05 

SW-PSSM 

16.66 

4.38 

2.02 

local 

14.1 

2.0 

0.7 

global 

15.1 

2.9 

1.4 

PSI 

15.8 

3.3 

1.4 

SSPSI 

16.0 

4.1 

2.6 

structural 

19.4 

9.1 

8.0 

The  optimized  SW-PSSM  results  are  achieved 
using  go  =  3.0,  ge  =  0.75,  zs  =  3.0.  All  the 
results  are  optimized  for  their  relevant  parame¬ 
ters 

5  Conclusion 

In  this  study  we  developed  algorithms  that  identify  the 
aligned  pairs  of  residues  using  an  incremental  approach. 
These  algorithms  capture  the  most  similar  pairs  of  subse¬ 
quences  as  part  of  the  final  alignment.  The  concepts  from 


Table  5:  Fraction  of  Correct  Models  based  on  the 
LGscore. 


FGscore 

<  10-3 

<  io- 

-5 

Alignment  Scheme 

Fm 

Sf 

Fd 

Fm 

Sf 

Fd 

SA'  (3) 

74 

27 

5 

55 

8 

0 

SA"  (3) 

74 

28 

4 

55 

8 

0 

SW-PSSM 

74 

27 

6 

56 

8 

0 

local 

66 

10 

1 

46 

2 

0 

global 

70 

12 

1 

49 

3 

0 

PSI 

72 

18 

4 

50 

4 

0 

SSPSI 

73 

21 

6 

53 

5 

0 

structural 

86 

60 

51 

66 

21 

21 

The  optimized  SW-PSSM  results 

i  are 

achieved  using 

go  =  3.0,  ge  =  0.75,  zs  =  3.0.  All  the  results  are  op¬ 
timized  for  their  relevant  parameters.  Fm,  Sf  and  Fd 
denote  the  family-level,  superfamily-level  and  fold- 
level  performance  results  respectively. 


Table  6:  Reliability  Assessment:  Recall  for  the  first  k%  er¬ 
rors. 


Method 

0% 

1% 

2% 

3% 

4% 

5% 

CA'  (3) 

0.176 

0.281 

0.365 

0.434 

0.494 

0.541 

SA'  (3) 

0.186 

0.297 

0.384 

0.459 

0.519 

0.563 

CSA'  (3) 

0.180 

0.286 

0.370 

0.438 

0.498 

0.545 

CA"  (3) 

0.254 

0.364 

0.450 

0.515 

0.566 

0.603 

SA"  (3) 

0.260 

0.368 

0.454 

0.521 

0.572 

0.612 

CSA"  (3) 

0.260 

0.367 

0.454 

0.520 

0.571 

0.610 

SW-PSSM 

0.205 

0.320 

0.405 

0.480 

0.541 

0.586 

The  optimized  SW-PSSM  results  are  achieved  using 
go  =  3.0,  ge  —  0.75,  zs  —  3.0.  The  numbers  in  the 
parenthesis  represent  the  wmer  width  used  for  the  results 
shown. 

string-kernel  theory  (use  of  ungapped  subsequences,  scored 
using  profiles)  played  an  integral  role  in  the  design  of  these 
alignment  algorithms. 

Our  comprehensive  experimental  study  on  the  template- 
based  and  model-based  benchmark  datasets  showed  com¬ 
parable  performance  to  a  fully  optimized  Smith- Waterman 
profile-based  implementation.  In  terms  of  the  reliability  per¬ 
formance  of  the  aligned  residue-pairs  we  notice  that  the  align¬ 
ment  schemes  using  variable  length  umers  had  very  promis¬ 
ing  results.  Amongst  the  window-based  schemes  we  no¬ 
ticed  that  the  subset  alignment,  SA  using  both  the  fixed  and 
variable  umers  showed  the  best  performance.  The  sensitiv¬ 
ity  analysis  done  by  varying  the  urner  size  showed  the  SA 
schemes  to  have  a  robust  performance. 

The  simplicity  of  our  methods  and  competitive  alignment 
quality  as  well  as  aligned  region  reliability  will  lead  to  the 
application  of  our  algorithms  in  key  bioinformatic  problems. 
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especially  comparative  modeling. 
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